This website has three main purposes:
Provides the results that we have obtained for each stage of the analysis.
Allows for reusing our code and reproducing our results, which in our view is a major limitation in the majority of the published literature since they do not provide code and it is often unclear what are some of the decision made in terms of data preprocessing and modeling.
Data was provided from the UNOS registry by staff at the US, United Network for Organ Sharing. Per the UNOS regulations, we cannot share their data. For obtaining the data, the reader should click here. We also highly recommend that the reader gets familiarized with the data dictionary (see Data Definitions for details).
To navigate this site, please feel free to use the navigation bar at the left. The reader can show any code chunk by clicking on the code button. We chose to make the default for the code hidden since we: (a) wanted to improve the readability of this document; and (b) assumed that the readers will not be interested in reading every code chunk.
The snippet below documents the list of R packages and functions that were used in this research. For convenience, we used the pacman package since it allows for installing/loading the needed packages in one step.
rm(list = ls()) # clear global environment
graphics.off() # close all graphics
if(require(pacman)==FALSE) install.packages("pacman") # needs to be installed first
# p_load is equivalent to combining both install.packages() and library()
pacman::p_load(DT, foreach, haven, magrittr, readxl, tidyverse, # important analytic packages
dataPreparation, DataExplorer, # important for exploratory data analysis
caret, lsmeans, emmeans, car, lmtest, yhat, sjPlot, sjmisc, ggplot2, effects, conflicted
)
conflict_prefer("recode", "dplyr")
setwd("H:/Shared drives/transplantation/Paper 2/Code/ISR Code/Markdown") # setting the working directory
source("custom_functions.R") # Has our custom functions
df = read_sas("../Data/thoracic_data.sas7bdat")
df$ID = row.names(df) # Index to be used in creating/tracking the training and test samples
orgTable = table(df[, 'WL_ORG'])
df %<>% dplyr::filter(WL_ORG == "HR") # filtering to keep only heart transplants
Our original UNOS data consisted of 159318 observations, which contained: (a) 428 unspecified transplant types, (b) 3148 combined heart-lung transplants, (c) 103570 heart transplants, and (d) 52172 lung transplants. Given our focus on heart transplants, we have only retained the observations where only a heart transplant is performed. The resulting data is resaved into our data.frame titled df, which now consists of 103570 observations and 495 variables (which have been unchanged from our filtering operation).
naVec = is.na(df) %>% colSums() %>%
`/`(nrow(df)) # To make it percent missing per variable
names(naVec) = colnames(df)
naVec.quantiles = quantile(naVec) %>% round(2)
ggplot() + geom_histogram(aes(naVec), bins = 10) +
xlab("Fraction Data Missing Per Variable") + ylab("Number of Variables")+theme_minimal()
Based on the heart transplantation data, 238 variables out of the 103570 have at least one data point missing data. Furthermore, we have computed the percent missing data per variable; the values of the first, second, and third quantiles (in terms of percent of data missing per column) correspond to 0%, 0%, and 60%,respectively. Thus, it is imperative to examine how these variables can be preprocessed prior to any analysis.
In the plot below, we sample 30 columns at random from the UNOS dataset to show the actual percentage of the data that is missing for each variable. The colors are used to denote the data quality for that column using a traffic light scheme (where green is good and red is bad)."
heart.na.plot = df[,sample(colnames(df), 30)] %>% plot_missing()
In the chunk below, we report the percent missing per each variable. The printout is arranged in a descending order of missing values.
data.frame(variableName = names(naVec), percentMissing = round(naVec*100, 2), row.names = NULL) %>%
arrange(desc(percentMissing)) %>% datatable()
R initially divides the columns of different types. We summarize these in the table below.
types = dataTypes(df) # see functions.R file
numeric.index = which(types$classes.df=="numeric")
character.index = which(types$classes.df=="character")
datatable(types) # printing the different types
The reader should note that R reads any variable that purely consists of numeric values to be numeric. However, from a data analysis perspective, a lot of these variables will need to be converted into factors since they represent factor levels rather than numeric values (e.g., donor/recipient ethinicity categories, allocation type, HLA mismatch level, education, etc.). Thus, the numeric variables should be examined more closely to ensure that they are properly coded by the software.
In this section, we will detailedly document the data preperation precedures we have done in the study.
In this subsection, we will capitalize on the file titled = “STAR File Documentation.xls” to automatically check and convert any variable that was assigned an incorrect type by the R software. The reader should note that the outcome of this data transformation step can be referred as technically correct data (De Jonge and Van Der Loo 2013). Note that we convert all factor columns to character to facilitate their manipulation in R.
varInfo = read_excel("../Data/STAR File Documentation.xls", sheet = "THORACIC DATA", skip =1, col_types = c(rep("text",3), rep("date",2), rep("guess",7)) )
# fix an error in the variable information document, there were two start dates: 01-Oct-87, 01-Oct-90 for the variable: CMV_DON. We assign 01-Oct-90 for the start date.
varInfo$`VAR START DATE`[51] <- "1990-10-01"
# pulling names of variables that had type char OR having numeric categories (explained in SAF)
factorVars = varInfo %>%
dplyr::filter(`DATA TYPE`=="CHAR" | (`DATA TYPE`=="NUM" & !is.na(`SAS ANALYSIS FORMAT`)) ) %>%
pull(`VARIABLE NAME`)
df[, factorVars] %<>% lapply(as.character) # converting factors to character to facilitate preprocessing
saveRDS(factorVars, "../Results/factorVars.RDS")
Based on the analysis above, we have identified 312 categorical variables. Note the R has previously only identified character/categorical variables. The names of the 312 variables can be accessed in the factorVars.RDS file. Furthermore, we applied the as.factor() to ensure that R recognizes these variables as categorical.
In the code chunk below, we first filter to keep adult patients since our study focuses on only adult transplantations. Then we remove individuals whose weights were lower than .01% of adults’ weights in the data (less than 30lbs) or whose heights were less than .01% adults’ heights in the data (lower than 122 cms).
df <- df %>% subset(AGE>=18) %>% # filtering to keep only Adults
# we excluded too light or too short people in the following code
subset(WGT_KG_DON_CALC >= quantile(WGT_KG_DON_CALC, 0.0001, na.rm = TRUE)) %>%
subset(WGT_KG_TCR >= quantile(WGT_KG_TCR, 0.0001, na.rm = TRUE)) %>%
subset(HGT_CM_DON_CALC >= quantile(HGT_CM_DON_CALC, 0.0001, na.rm = TRUE)) %>%
subset(HGT_CM_TCR >= quantile(HGT_CM_TCR, 0.0001, na.rm = TRUE))
In the code chunk below, we fist filter/remove any observations with missing GSTATUS or GTIME. Our rationale for doing this prior to removing uniformative/irrelevant/unreliable variables was based on the fact that these variables would be used in computing our transplantation OUTCOME and hence, we did not want to lose them. After performing remove any variables that meet any of the following criteria:
varsVec = read_rds('../Data/dropVars.RDS') # a vector of 52 vars to be dropped
alleles = c("DA1","DA2","RA1","RA2","DB1","DB2","RB1","RB2","RDR1","RDR2","DDR1","DDR2")
VarstoBeDropped = varInfo %>%
dplyr::filter(`VAR START DATE` >= '2000-01-01' |
!is.na(`VAR END DATE`) |
str_detect(FORM, "TRF/TRR|TRR/TRF-CALCULATED|TRR/TRF|TRF") |
str_detect(`FORM SECTION`, "POST TRANSPLANT CLINICAL INFORMATION") |
str_detect(DESCRIPTION, "IDENTIFIER|DATE") |
`VARIABLE NAME` %in% varsVec |
`VARIABLE NAME` %in% alleles |
`VARIABLE NAME` %in% c("DEATH_CIRCUM_DON","DEATH_MECH_DON", "TX_YEAR") ) %>%
pull(`VARIABLE NAME`)
# include variables corresponding to prior cardiac surgery (non-transplant) or lung surgery
VarstoBeDropped <- setdiff(VarstoBeDropped, c("PRIOR_CARD_SURG_TCR", "PRIOR_CARD_SURG_TRR", "PRIOR_CARD_SURG_TYPE_OSTXT_TCR", "PRIOR_CARD_SURG_TYPE_OSTXT_TRR",
"PRIOR_CARD_SURG_TYPE_TCR", "PRIOR_CARD_SURG_TYPE_TRR", "PRIOR_LUNG_SURG_TYPE_OSTXT_TRR"))
df %<>% dplyr::filter(!is.na(GSTATUS) & !is.na(GTIME) ) %>% # remove obs missing graft status/time
select(!all_of(VarstoBeDropped)) %>% # exclude variables to be dropped
discard(~sum(is.na(.x))/length(.x)* 100 >= 90) %>% # remove vars with >= 90 % missing
dropNumericMissing(percent = 30) # drop numeric columns w/ >= 30% missing (see custom_functions.R)
In the code chunk below, we create the following variables:
After these variables were created, we dropped both GTIME and GSTATUS as well as the four variables capturing the medications prescribed/applied on the deceased donor (i.e. PT_OTH1_OSTXT_DON – PT_OTH4_OSTXT_DON). Furthemore, we converted all indicator columns into characters.
df %<>% mutate(ISCHTIME = ISCHTIME*60) %>%
mutate(PVR = (HEMO_PA_MN_TRR - HEMO_PCW_TRR)*79.72/HEMO_CO_TRR) %>%
mutate(ECMO = ifelse(ECMO_TCR + ECMO_TRR == 0, 0, 1)) %>%
mutate(CARD_SURG = if_else(PRIOR_CARD_SURG_TCR=="Y"|PRIOR_CARD_SURG_TRR=="Y", true="Y", false=ifelse(PRIOR_CARD_SURG_TCR=="N"&PRIOR_CARD_SURG_TRR=="N", "N", NA))) %>%
mutate(BMI_CHNG = 100*(BMI_CALC- INIT_BMI_CALC)/INIT_BMI_CALC) %>%
mutate(WGT_CHNG = 100*(WGT_KG_CALC - INIT_WGT_KG_CALC)/INIT_WGT_KG_CALC) %>%
mutate(HGT_CHNG = 100*(HGT_CM_CALC - INIT_HGT_CM_CALC)/INIT_HGT_CM_CALC) %>%
mutate(AGE_DIFF = abs(AGE - AGE_DON)) %>%
mutate(BMI_DIFF = abs(BMI_CALC - BMI_DON_CALC)) %>%
mutate(ETH_MAT = if_else(ETHCAT == ETHCAT_DON, true = "Y", false = "N")) %>%
mutate(GEN_MAT = if_else(GENDER == GENDER_DON, true = "Y", false = "N")) %>%
mutate(ANCEF = if_else(str_detect(PT_OTH1_OSTXT_DON, 'ANCEF') |
str_detect(PT_OTH2_OSTXT_DON, 'ANCEF') |
str_detect(PT_OTH3_OSTXT_DON, 'ANCEF') |
str_detect(PT_OTH4_OSTXT_DON, 'ANCEF'),
true = "Y", false = "N") ) %>%
mutate(DOPAMINE = if_else(str_detect(PT_OTH1_OSTXT_DON, 'DOPAMINE') |
str_detect(PT_OTH2_OSTXT_DON, 'DOPAMINE') |
str_detect(PT_OTH3_OSTXT_DON, 'DOPAMINE') |
str_detect(PT_OTH4_OSTXT_DON, 'DOPAMINE'),
true = "Y", false = "N") ) %>%
mutate(HEPARIN = if_else(str_detect(PT_OTH1_OSTXT_DON, 'HEPARIN') |
str_detect(PT_OTH2_OSTXT_DON, 'HEPARIN') |
str_detect(PT_OTH3_OSTXT_DON, 'HEPARIN') |
str_detect(PT_OTH4_OSTXT_DON, 'HEPARIN'),
true = "Y", false = "N") ) %>%
mutate(ZOSYN = if_else(str_detect(PT_OTH1_OSTXT_DON, 'ZOSYN') |
str_detect(PT_OTH2_OSTXT_DON, 'ZOSYN') |
str_detect(PT_OTH3_OSTXT_DON, 'ZOSYN') |
str_detect(PT_OTH4_OSTXT_DON, 'ZOSYN'),
true = "Y", false = "N") ) %>%
rowwise() %>%
mutate(OUTCOME = recodeGSTATUS(gStatus = GSTATUS, gTime = GTIME, targetYear = 1)) %>% # see custom_functions.R
select(-c(GSTATUS, GTIME, starts_with("PT_OTH")))
df %<>% ungroup() %>%
mutate_at(c('ECMO', 'CARD_SURG', 'ETH_MAT', 'GEN_MAT', 'ANCEF', 'DOPAMINE', 'HEPARIN', 'ZOSYN', 'OUTCOME'), as.character)
Despite the application of the nearZeroVar() from the caret package, several of the remaining character/factor variables still exhibit a large number of unique values. The purpose of the code chunk below is to quantify the variability in the remaining character/factor variables.
df %>% select(-ID) %>% summarise_if(is.character, n_distinct, na.rm = TRUE) -> numLevels
If we were to exclude the ID column, the number of distinct values in each character variable (equivallently the number of non-NA levels if these variables were encoded as factors) has the following summary statistics:
Accordingly, we will have to examine how to reduce the number of levels within each variable to be able to use these variables in the machine learning stage; we do not want to generate a very large number of dummy variables and we will group some of those levels by capitalizing on the similarities in interpretting some of these levels.
While recipes package provides an excellent pipeline for data pre-processing, we have elected to manually go through the process of grouping variables since this would allow us to (a) utilize the information in the factor levels to combine similar categories (whether due to location, education status or medical diagnosis codes); and (b) the integration of the recipes package into the training of machine learning models would require to train the models repeatdly for each recipe per the documentation of Kuhn (2019). In the code chunk below, we provide the details for how we regrouped several values/levels of each categorical/character vector with the hope of improving both the efficiency and prediction performance of the different machine learning models. In addition, we have also used the recode() to rename some of the categories since R can produce errors when one-hot encoding is applied if the name of the category is numeric; therefore, some of the operations performed below merely changed the name of the category to avoid future errors when processing the data. After regrouping, we performed an additional step of variable elimination using the caret nearZeroVar() function. We tuned the function such that the cutt-off (a) ratio of the frequency of the most common value to the frequency of the second most common value \(\small \ge\) 200; and (b) percentage of unique values in the samples is less than 0.01%.
df$ABO %<>% recode(A = 'A', A1 = 'A', A2 = 'A', B = 'B', O = 'O', AB = 'AB', A1B = 'AB', A2B = 'AB')
df$ABO_DON %<>% recode(A = 'A', A1 = 'A', A2 = 'A', B = 'B', O = 'O', AB = 'AB', A1B = 'AB', A2B = 'AB')
df$ABO_MAT %<>% recode(`1` = "IDENTICAL", `2` = "NOT_IDENTICAL", `3` = "NOT_IDENTICAL")
df$AMIS %<>% recode(`0` = "NO_MISMATCH", `1` = "ONE_MISMATCHED", `2` = "TWO_MISMATCHED")
df$BMIS %<>% recode(`0` = "NO_MISMATCH", `1` = "ONE_MISMATCHED", `2` = "TWO_MISMATCHED")
df$BRONCHO_LT_DON %<>% recode(`1` = "NO", `2` = "NORMAL", `3` = "ABNORMAL", `4` = "ABNORMAL", `5` = "ABNORMAL",
`6` = "ABNORMAL", `7` ="OTHER", `998` = "UNKNOWN")
df$BRONCHO_RT_DON %<>% recode(`1` = "NO", `2` = "NORMAL", `3` = "ABNORMAL", `4` = "ABNORMAL", `5` = "ABNORMAL",
`6` = "ABNORMAL", `7` ="OTHER", `998` = "UNKNOWN")
df$CHEST_XRAY_DON %<>% recode(`0` = "UNKNOWN", `1` = "NO", `2` = "NORMAL", `3` = "ABNORMAL_SINGLE",
`4` = "ABNORMAL_SINGLE", `5` = "ABNORMAL_BOTH",
`998` = "UNKNOWN", `999` = "UNKNOWN")
df$COD_CAD_DON %<>% recode(`1` = 'ANOXIA', `2` = 'CEREBROVASCULAR_STROKE', `3` = 'HEAD_TRAUMA',
`4` = 'OTHER', `999` = 'OTHER', Unknown = 'UNKNOWN')
df$CORONARY_ANGIO %<>% recode(`1` = "NO", `2` = "YES_NORMAL", `3` = "YES_ABNORMAL")
df$CMV_DON %<>% recode(C = "UNKNOWN", I = "UNKNOWN", N = "NEGATIVE", ND = "UNKNOWN",
P = "POSITIVE", PD = "UNKNOWN", U = "UNKNOWN")
df$DIAB %<>% recode(`1` = 'NO', `2` = 'ONE', `3` = 'TWO', `4` = 'OTHER', `5` = 'OTHER', `998` = 'UNKNOWN')
df$DIAG %<>% recode(`1000` = 'DILATED_MYOPATHY_IDI', `1001` = 'DILATED_MYOPATHY_OTH',
`1002` = 'DILATED_MYOPATHY_OTH', `1003` = 'DILATED_MYOPATHY_OTH',
`1004` = 'DILATED_MYOPATHY_OTH', `1005` = 'DILATED_MYOPATHY_OTH',
`1006` = 'DILATED_MYOPATHY_OTH', `1007` = 'DILATED_MYOPATHY_ISC',
`1049` = 'DILATED_MYOPATHY_OTH',
`1200` = 'CORONARY', `999` = 'OTHER', `1999` = 'UNKNOWN')
df$DRMIS %<>% recode(`0` = "NO_MISMATCH", `1` = "ONE_MISMATCHED", `2` = "TWO_MISMATCHED")
df$EBV_SEROSTATUS %<>% recode(C = "UNKNOWN", I = "UNKNOWN", N = "NEGATIVE", ND = "UNKNOWN",
P = "POSITIVE", PD = "UNKNOWN", U = "UNKNOWN")
df$EDUCATION %<>% recode(`1` = 'OTHER', `2` = 'GRADE', `3` = 'HIGH', `4` = 'COLLEGE', `5` = 'UNIVERSITY',
`6` = 'UNIVERSITY', `996` = 'OTHER', `998` = 'UNKNOWN')
df$ETHCAT %<>% recode(`1` = 'WHITE', `2` = 'BLACK', `4` = 'HISPANIC', `5` = 'OTHER', `6` = 'OTHER',
`7` = 'OTHER', `9` = 'OTHER', `998` = 'UNKNOWN')
df$ETHCAT_DON %<>% recode(`1` = 'WHITE', `2` = 'BLACK', `4` = 'HISPANIC', `5` = 'OTHER', `6` = 'OTHER',
`7` = 'OTHER', `9` = 'OTHER', `998` = 'UNKNOWN')
df$FUNC_STAT_TCR %<>% recode(`1` = "ABLE", `2` = "ASSISTED", `3` = "DISABLE", `996` = "OTHER", `998` = "UNKNOWN",
`2010` = "DISABLE", `2020` = "DISABLE", `2030` = "DISABLE", `2040` = "DISABLE",
`2050` = "ASSISTED", `2060` = "ASSISTED" , `2070` = "ASSISTED",
`2080` = "ABLE", `2090` = "ABLE", `2100` = "ABLE")
df$FUNC_STAT_TRR %<>% recode(`1` = "ABLE", `2` = "ASSISTED", `3` = "DISABLE", `996` = "OTHER", `998` = "UNKNOWN",
`2010` = "DISABLE", `2020` = "DISABLE", `2030` = "DISABLE", `2040` = "DISABLE",
`2050` = "ASSISTED", `2060` = "ASSISTED" , `2070` = "ASSISTED",
`2080` = "ABLE", `2090` = "ABLE", `2100` = "ABLE")
df$HBV_CORE %<>% recode(C = "UNKNOWN", I = "UNKNOWN", N = "NEGATIVE", ND = "UNKNOWN",
P = "POSITIVE", PD = "UNKNOWN", U = "UNKNOWN")
df$HBV_CORE_DON %<>% recode(C = "UNKNOWN", I = "UNKNOWN", N = "NEGATIVE", ND = "UNKNOWN",
P = "POSITIVE", PD = "UNKNOWN", U = "UNKNOWN")
df$HBV_SUR_ANTIGEN %<>% recode(C = "UNKNOWN", I = "UNKNOWN", N = "NEGATIVE", ND = "UNKNOWN",
P = "POSITIVE", PD = "UNKNOWN", U = "UNKNOWN")
df$HCV_SEROSTATUS %<>% recode(C = "UNKNOWN", I = "UNKNOWN", N = "NEGATIVE", ND = "UNKNOWN",
P = "POSITIVE", PD = "UNKNOWN", U = "UNKNOWN")
df$HEP_C_ANTI_DON %<>% recode(C = "UNKNOWN", I = "UNKNOWN", N = "NEGATIVE", ND = "UNKNOWN",
P = "POSITIVE", PD = "UNKNOWN", U = "UNKNOWN")
df$HIST_DIABETES_DON %<>% recode(`1` = 'NO', `2` = 'YES', `3` = 'YES', `4` = 'YES' , `5` = 'YES', `998` = 'UNKNOWN')
df$HIV_SEROSTATUS %<>% recode(C = "UNKNOWN", I = "UNKNOWN", N = "NEGATIVE", ND = "UNKNOWN",
P = "POSITIVE", PD = "UNKNOWN", U = "UNKNOWN")
df$HLAMIS %<>% recode(`0` = 'LOWEST', `1` = 'LOWEST', `2` = 'LOWEST', `3` = 'LOW',
`4` = 'MEDIUM', `5` = 'HIGH', `6` = 'HIGHEST')
df$HTLV2_OLD_DON %<>% recode(C = "UNKNOWN", I = "UNKNOWN", N = "NEGATIVE", ND = "UNKNOWN",
P = "POSITIVE", PD = "UNKNOWN", U = "UNKNOWN")
df$INIT_STAT %<>% recode(`1010` = 'ONE', `1020` = 'ONE', `1030` = 'TWO', `1090` ='ONE',
`1999` = 'TEMPORARILY_INACTIVE', `2010` = 'ONE', `2020` = 'ONE', `2030` = 'TWO',
`2090` = 'ONE', `2999` = 'TEMPORARILY_INACTIVE', `7010` = 'ACTIVE',
`7999` = 'TEMPORARILY_INACTIVE')
df$INOTROPES_TCR %<>% recode(`0` = "NO", `1` = "YES")
df$INOTROPES_TRR %<>% recode(`0` = "NO", `1` = "YES")
df$MED_COND_TRR %<>% recode(`1` = "ICU_HOSPITALIZED", `2` = "HOSPITALIZED", `3` = "NOT_HOSPITALIZED")
df$NUM_PREV_TX %<>% recode(`0` = 'ZERO', .default = 'NON_ZERO', .missing = NA_character_)
df$PRI_PAYMENT_TCR %<>% recode(`1` = "PRIVATE", `2` ="MEDICAID", `3` = "MEDICARE_FREE", `4` = "PUBLIC_OTHER",
`5` = "PUBLIC_OTHER", `6` = "PUBLIC_OTHER", `7` = "PUBLIC_OTHER", `8` = "OTHER",
`9` = "OTHER", `10` = "OTHER", `11` = "OTHER", `12` = "OTHER", `13` = "PUBLIC_OTHER",
`14` = "OTHER", `15` = "UNKNOWN")
df$PRI_PAYMENT_TRR %<>% recode(`1` = "PRIVATE", `2` ="MEDICAID", `3` = "MEDICARE_FREE", `4` = "PUBLIC_OTHER",
`5` = "PUBLIC_OTHER", `6` = "PUBLIC_OTHER", `7` = "PUBLIC_OTHER", `8` = "OTHER",
`9` = "OTHER", `10` = "OTHER", `11` = "OTHER", `12` = "OTHER", `13` = "PUBLIC_OTHER",
`14` = "OTHER", `15` = "UNKNOWN")
df$PROC_TY_HR %<>% recode(`1` = "BICAVAL", `2` = "TRADITIONAL")
df$PULM_INF_DON %<>% recode(`0` = "NO", `1` = "YES")
df$REGION %<>% recode(`1` = "NORTH_EAST", `2` = "NORTH_EAST", `3` = "SOUTH_EAST", `4` = "SOUTH_EAST", `5` = "WEST",
`6` = "WEST", `7` = "MIDWEST", `8` = "MIDWEST", `9` = "NORTH_EAST", `10` = "MIDWEST",
`11` = "SOUTH_EAST")
df$SHARE_TY %<>% recode(`3` = "LOCAL", `4` = "REGIONAL", `5` = "NATIONAL")
nzvColLoc = nearZeroVar(df, freqCut = 200, uniqueCut = 0.01)
df = df[,-nzvColLoc]
Will add something hree
In this section, we analysis the result we obtain for all scenarios using hierarchical regression.
In the following code chunk, we read the result data for all scenarios.
data <- read.csv("https://raw.githubusercontent.com/blindedResearch/transplant/master/all_scenarios.csv")
Next we check if we obtain all values of performance measures for all scenarios by studying the distribution of missing values in the data. Then we remove the cases with missing values since we are not able to use them for the further analysis.
We study the distribution of missing values in the result data.
plot_missing(data)
plot_histogram(data[,c(10:13,15)])
We print the cases that are missing in the following table. We note that all 17 missing cases are of algorithm KPLSR.
na_vec=which(is.na(data$auc))
datatable(data[na_vec,c(2,3,4,5,6,7)])#prints the cases that are missing
We glimpse the cases that are not missing.
head(data[-na_vec,c(2,3,4,5,6,7)])#glimpses the cases that are not missing
Below we take a look at the complete data.
nonconverge_cases=data[na_vec,]
complete=data[-na_vec,]
plot_missing(complete)
We explicitly removed the NA’s from the data, creating a data frame called complete.
plot_histogram(complete[,c(10:13,15)])
We will use “hierarchical Regression”. This is a simple technique of entering variables into a regression in blocks so that we can see the “net effect” of the variables on the response. It is a popular method in explanatory modeling, especially in the behavioral sciences. In the behavioral sciences, researchers will enter, first the demographics or variables that have not been manipulated, and are outside of the control of the researcher (e.g. gender, race), then they may enter existing variables that are known to effect a response, but are not under study in the current research, then they will enter the newly proposed research variables. The goal is to show that the newly proposed variables “explain” a large percentage of additional variation.
We believe that this audience is used to seeing explanatory analyses presented in this way. It is actually fairly informative.
In the examples below, we choose three blocks:
Our reasoning is that imputation and encoding is done first, while data cleaning. Next, subsampling is done prior to any analysis. Finally, the algorithm and variable selection are done in tandem, with choices made simultaneously.
We chose to consider AUC and gmean as both summarize well for the unbalanced example and give different views of the fit. The results differ slightly according to the response.
In the following code chunk, we obtain the Anova II SS using Block 1.
block1auc=formula(auc~(imputation_num+imputation_cat+encoding)^2)
aucmod1=lm(block1auc,data=complete)
#write.csv(Anova(aucmod1,type="II"),file="mod1anova2.csv")
Anova(aucmod1,type="II")
From the above, we see that the interaction between numeric and categorical imputation is significant.
print(paste("R-square = ", round(summary(aucmod1)$r.square, 4)))
## [1] "R-square = 0.01"
Despite the statistical significance of the imputation method interaction, we can see that multiple R-square is very low. We must keep in mind that the error degrees of freedom is 10,770. Power is high, and effect sizes are quite small.
We report the individual coefficients in the following table.
b=data.frame(coefficients(aucmod1))
b$stderr=summary(aucmod1)$coefficients[,2]
b$t=summary(aucmod1)$coefficients[,3]
b$pval=summary(aucmod1)$coefficients[,4]
table1=round(b, 4)
colnames(table1)=c("B","Std Error","t","p-val")
#write.csv(table1,file="mod1table.csv")
datatable(table1)
In the following code chunk, we obtain the Anova II SS using Block 2.
block2auc=formula(auc~(imputation_num+imputation_cat+encoding
+subsampling)^2)
aucmod2=lm(block2auc,data=complete)
#write.csv(Anova(aucmod2,type="II"),file="mod2anova2.csv")
Anova(aucmod2,type="II")
From the above, we see that the numerical and categorical imuptation interaction remains significant in this model. In addition, the main effect for subsampling is significant.
print(paste("R-square = ", round(summary(aucmod2)$r.square, 4)))
## [1] "R-square = 0.2537"
waldtest(aucmod1,aucmod2)
r2diff12=round(summary(aucmod2)$r.squared-summary(aucmod1)$r.squared, 4)
print(paste("r-squared difference = ",r2diff12))
## [1] "r-squared difference = 0.2437"
We can see quite a change in the R-sqare from the addition of resampling/subsampling.
The wald test tests for significance of the additional information accounted for by blocks of predictors added to a model. We have statistical significance with F = 146.25, on 24 numerator and 10770 denominator degrees of freedom (p<2.2e-16). This indicates that the model has a significant drop in SSE by including subsampling method in the model that already included encoding and imputation methods (categorical and numerical). This indicates that subsampling is (statistically) important to the AUC of the model.
In addition the increase in R-square is 24.37% when subsampling is included compared to the model with only encoding and imputation methods. Note this is explanatory in terms of the behavior of the experimental data.
We report the individual coefficients in the following table.
b=data.frame(coefficients(aucmod2))
b$stderr=summary(aucmod2)$coefficients[,2]
b$t=summary(aucmod2)$coefficients[,3]
b$pval=summary(aucmod2)$coefficients[,4]
table2=round(b, 4)
colnames(table2)=c("B","Std Error","t","p-val")
#write.csv(table2, file="mod2table.csv")
datatable(table2)
In the following code chunk, we obtain the Anova II SS using Block 3.
block3auc=formula(auc~(imputation_num+imputation_cat+encoding
+subsampling+feature_selection+algorithm)^2)
aucmod3=lm(block3auc,data=complete)
#summary(aucmod3)
#write.csv(Anova(aucmod3,type="II"), file="mod3anova2.csv")
Anova(aucmod3,type="II")
The results are quite complex. All two-way interactions are significant except for interactions that include encoding. It seems that encoding is not relevant to AUC, but all other variables interact.
print(paste("R-square = ", round(summary(aucmod3)$r.square, 4)))
## [1] "R-square = 0.8941"
waldtest(aucmod2,aucmod3)
r2diff23=round(summary(aucmod3)$r.squared-summary(aucmod2)$r.squared,4)
print(paste("r-squared difference = ",r2diff23))
## [1] "r-squared difference = 0.6404"
The wald test tests for significance of the additional information accounted for by blocks of predictors added to a model. We have statistical significance with F = 554.32, on 116 numerator and 10746 denominator degrees of freedom (p<2.2e-16). This indicates that the model has a significant drop in SSE by including the ML and Variable Selection methods along with the corresponding interaction terms.
The difference in the model R-square values is 64% between the model in Block 2 and the model in Block 3. This suggests that Algorithm and Variable selection method explain more observed variation in AUC than the Block 1 and 2 factors.
We report the individual coefficients in the following table.
b=data.frame(coefficients(aucmod3))
b$stderr=summary(aucmod3)$coefficients[,2]
b$t=summary(aucmod3)$coefficients[,3]
b$pval=summary(aucmod3)$coefficients[,4]
table3=round(b, 4)
colnames(table3)=c("B","Std Error","t","p-val")
#write.csv(table3,file="mod3table.csv")
datatable(table3)
Even though the two way interactions are statistically significant, many are not practically significant. We recommend computing a common measure of “effect size” known as partial \(\eta_p^2\). This is equivalent to what statisticians call partial \(R^2\) and is computed as \[\eta_p^2=\frac{SS_{factor}}{SS_{factor}+SS_{error}}.\]
anovaauc=Anova(aucmod3,type="II")
error=anovaauc$`Sum Sq`[nrow(anovaauc)]
p.eta.sq=as.data.frame(anovaauc$`Sum Sq`/(error+anovaauc$`Sum Sq`))
rownames(p.eta.sq)=rownames(anovaauc)
colnames(p.eta.sq)=c("effect")
#write.csv(p.eta.sq,file="etasq.csv")
datatable(round(p.eta.sq,4))
In the behavioral sciences, effects that are smaller than 0.02. Based on this analysis, we will retain the following in our model: Main effects: imputation_num, imputation_cat, subsampling, feature_selection, algorithm, and
imputation_num:imputation_cat, imputation_num:feature_selection, subsampling:feature_selection, subsampling:algorithm, feature_selection:algorithm
In the following code chunk, we obtain the Anova II SS using Block 4.
block4auc=formula(auc~imputation_num+imputation_cat+subsampling+feature_selection+algorithm
+imputation_num*imputation_cat+
imputation_num*feature_selection+
subsampling*feature_selection+
subsampling*algorithm+
feature_selection*algorithm)
aucmod4=lm(block4auc,data=complete)
#summary(aucmod4)
#write.csv(Anova(aucmod4,type="II"),file="mod4anova2.csv")
Anova(aucmod4,type="II")
print(paste("R-square = ", round(summary(aucmod4)$r.square, 4)))
## [1] "R-square = 0.8903"
waldtest(aucmod4,aucmod3)
r2diff43=round(summary(aucmod4)$r.squared-summary(aucmod3)$r.squared,4)
print(paste("r-squared difference = ",r2diff43))
## [1] "r-squared difference = -0.0039"
While there is a statistically significant difference between models 3 and 4 (F=5.3301 on 73 and 10711 df, p<2.21e-16), the difference in the \(R^2\) values is less than 1%.
If we wish to explain the interaction terms, this might be an easier model to explain.
We report the individual coefficients in the following table.
b=data.frame(coefficients(aucmod4))
b$stderr=summary(aucmod4)$coefficients[,2]
b$t=summary(aucmod4)$coefficients[,3]
b$pval=summary(aucmod4)$coefficients[,4]
table4=round(b, 4)
colnames(table4)=c("B","Std Error","t","p-val")
#write.csv(table4,file="mod4table.csv")
datatable(table4)
Here we conducted parwise comparisons among interactions of factors using Tukey-adjusted comparisons using Model 4.
catnum=lsmeans(aucmod4,pairwise~imputation_cat:imputation_num,adjust="tukey")
plot(catnum[[2]])+theme_minimal()
The above plot shows the Tukey’s Least Squares adjusted mean pairwise contrasts with default 95% family-wise adjusted confidence intervals. Although there is some statistically significant differences (indicated by intervals that do not cross zero), practically, the differences in AUC are not important. It does, however, appear that Dropping both the numerical and categorical missing is not advisable.
numfs=lsmeans(aucmod4,pairwise~imputation_num:feature_selection,adjust="tukey")
plot(numfs[[2]])+theme_minimal()
This is pretty difficult to interpret. But it appears that subsampling methods of Down, coupled with Median imputation or dropping the missing altogether seems to work the best. Rose is not advisable.
In the following, we show the parwise comparisons among interactions of sumsampling strategies and machine learning algorithms. Since there are 990 comparison cases, we show every 45 cases in one figure so that it is more readable.
algresamp=lsmeans(aucmod4,pairwise~algorithm:subsampling,adjust="tukey")
#45 choose 2 = 990 cases.
for (i in 0:21){
assign(paste0("p",(i+1)),plot(algresamp[[2]][(i*45+1):((i+1)*45)])+theme_minimal())
# We introduce a tab for each participant using tabset
cat('####',paste0("P", (i+1)), "{-}",'\n')
print(get(paste0("p",(i+1))))
cat('\n \n')
}
In this section, we study the interaction effects on auc values.
theme_set(theme_sjplot())
plot_model(aucmod4, type="pred", terms=c("imputation_cat", "imputation_num"), line.size = 1.5, title="", axis.title = c("Categorical Variable Imputation","AUC"), legend.title = "Numerical Variable Imputation")+font_size(axis_title.x=14, axis_title.y=14,labels.x=12, labels.y=12)+ylim(0.35, 0.65)+theme(legend.position="top")
plot_model(aucmod4, type="pred", terms=c("imputation_num", "feature_selection"), line.size = 1.5, title="", axis.title=c("Numerical Variable Imputation", "AUC"),
legend.title="Feature Selection")+font_size(axis_title.x=14, axis_title.y=14,labels.x=12, labels.y=12)+ylim(0.35, 0.65)+theme(legend.position="top")
plot_model(aucmod4, type="pred", terms=c("subsampling","feature_selection"), line.size = 1.5, title="", axis.title=c("Subsampling Strategy", "AUC"),
legend.title="Feature Selection")+font_size(axis_title.x=14, axis_title.y=14,labels.x=12, labels.y=12)+ylim(0.35, 0.65)+theme(legend.position="top")
plot_model(aucmod4, type="pred", terms=c("algorithm", "subsampling"), line.size = 1.5, title="",axis.title=c("Machine Learning Algorithm", "AUC"),
legend.title="Subsampling Strategy")+font_size(axis_title.x=14, axis_title.y=14,labels.x=12, labels.y=12)+ylim(0.35, 0.65)+theme(legend.position="top")
plot_model(aucmod4, type="pred", terms=c("algorithm","feature_selection"), line.size = 1.5, title="", axis.title=c("Machine Learning Algorithm", "AUC"),
legend.title="Feature Selection")+font_size(axis_title.x=14, axis_title.y=14,labels.x=12, labels.y=12)+ylim(0.35, 0.65) +theme(legend.position="top")
warnings()
Chen CK, Manlhiot C, Mital S, Schwartz SM, Van Arsdell GS, Caldarone C, McCrindle BW, Dipchand AI (2018) Prelisting predictions of early postoperative survival in infant heart transplantation using classification and regression tree analysis. Pediatric transplantation 22(2):e13105.
Dag A, Oztekin A, Yucel A, Bulur S, Megahed FM (2017) Predicting heart transplantation outcomes through data analytics. Decision Support Systems 94:42–52.
Dag A, Topuz K, Oztekin A, Bulur S, Megahed FM (2016) A probabilistic data-driven framework for scoring the preoperative recipient-donor heart transplant survival. Decision Support Systems 86:1–12.
De Jonge E, Van Der Loo M (2013) An introduction to data cleaning with R (Statistics Netherlands, The Hague/Heerlen).
Guven G, Brankovic M, Constantinescu AA, Brugts JJ, Hesselink DA, Akin S, Struijs A, et al. (2018) Preoperative right heart hemodynamics predict postoperative acute kidney injury after heart transplantation. Intensive care medicine 44(5):588–597.
Kuhn M (2019) 12 using recipes with train | the caret package.
Medved D, Ohlsson M, Höglund P, Andersson B, Nugues P, Nilsson J (2018) Improving prediction of heart transplantation outcome using deep learning techniques. Scientific Reports 8(1):1–9.
Parham P (2014) The immune system (Garland Science).
Shmueli G, others (2010) To explain or to predict? Statistical science 25(3):289–310.
Vega E, Schroder J, Nicoara A (2017) Postoperative management of heart transplantation patients. Best Practice & Research Clinical Anaesthesiology 31(2):201–213.
Weisdorf D, Spellman S, Haagenson M, Horowitz M, Lee S, Anasetti C, Setterholm M, et al. (2008) Classification of hla-matching for retrospective analysis of unrelated donor transplantation: Revised definitions to predict survival. Biology of Blood and Marrow Transplantation 14(7):748–758.
Zhang K, Sheu R, Zimmerman NM, Alfirevic A, Sale S, Gillinov AM, Duncan AE (2019) A comparison of global longitudinal, circumferential, and radial strain to predict outcomes after cardiac surgery. Journal of cardiothoracic and vascular anesthesia 33(5):1315–1322.